Lin Chen
Part 1. Introduction
Part 2. Data Wrangling
Part 3. Data Analysis
Part 4. Conclusion
As newly admitted graduate students at the University of Calgary, safety issues could be one of the most significant concerns, since we may have lots of questions associated to our daily lives and public safety, such as:
Where to live?
Where to go shopping?
Which time and where to hang out at night…
Therefore, conducting detailed research about community crime statistics could be both reasonable and useful for new residents to discover the city and make safe choices.
All our guiding questions can be divided into four categories:
1. Time-related:
Which time in a year is relatively safe?
Is Calgary becoming a safer place during recent years?
2. Location-related:
Which area in Calgary is relatively safe?
Which area in Calgary is relatively unsafe?
How unsafe is the downtown area, what's the major crime specie here?
3. Crime-related:
What are general categories for crime records in Calgary?
What kind of crime happened most frequently?
How can we correctly measure safety level by using crime data?
4. Advanced-level:
Can I have an crime map? Better to be an interactive one.
Can I use animation to show how safety level changes by both location and time?
Python Libraries Used:
Numpy, Pandas, Matplotlib, GeoPandas, Geoplot, Plotly, Folium, etc.
Step 1. Data Gathering
Step 2. Data Assessing
Step 3. Data Cleansing
Step 4. Data Storing
Data Source: https://data.calgary.ca/Health-and-Safety/Community-Crime-and-Disorder-Statistics-to-be-arch/848s-4m4z
Click 'Export' on the top right part of webpage, and choose 'CSV' format to download. This tabular dataset includes category, time, location and other information for Calgary crimes from 2012-2019, and needs to be cleaned.
Data License: https://data.calgary.ca/stories/s/Open-Calgary-Terms-of-Use/u45n-7awa
Free to copy, modify, publish, translate, adapt, distribute or use the Information in any medium.
Identify Data Quality Issues:
Edit Columns:
1) column ID is just a combination of crime information from other columns.
2) few column names are too verbose.
Wrong Datatypes:
1) column Date is not under a correct datetype format.
2) column Community Center Point is in a from of (a, b) and cannot be used directly.
Missing Values:
1) for row No.4022, category for the crime is missing.
2) column Resident Count has lots of values of 0, some of them are probably missing data.
Data Conflicts:
1) some communities were categorized into different sectors in different years.
Solve Data Quality Issues (Part 1):
Edit Columns:
1) drop the column of ID.
2) rename column names with two words into single-word names for conveniency.
Wrong Datatypes:
1) correct datatype for column Date.
2) column Community Center Point should be split into two floats as latitudes and longitudes.
Missing Values:
1) for row No.4022, put 'unknown' as crime category.
2) for column Resident Count has lots of values of 0, try our best to recover missing values.
For example, crimes could happen in areas without residents, like a wildland. In fact, there is no resident in industrial area or even airport, since people just work but not live there. The solution is 'take the maximum': if we have 5 records for resident count at community A in October 2018: [50, 50, 50, 49, 0], then all of them will be changed to 50. If those five records are [0, 0, 0, 0, 0], then the resident count will keep as 0.
Data Conflicts:
1) three communities will be put into the sectors that they were categorized most recently.
For example, Glenmore Park was in South from 2015 to 2019, but was in West from 2012 to 2014. After modification, all crimes in Glenmore Park will be put into the sector of South.
Comparison between raw data and cleaned data:
| Sector | Community Name | Group Category | Category | Crime Count | Resident Count | Date | Year | Month | ID | Community Center Point | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTH | THORNCLIFFE | Crime | Theft FROM Vehicle | 9 | 8474 | 03/01/2018 12:00:00 PM | 2018 | MAR | 2018-MAR-THORNCLIFFE-Theft FROM Vehicle-9 | (51.103099554741, -114.068779421169) |
| 1 | SOUTH | WOODBINE | Crime | Theft FROM Vehicle | 3 | 8866 | 11/01/2019 12:00:00 AM | 2019 | NOV | 2019-NOV-WOODBINE-Theft FROM Vehicle-3 | (50.939610852207664, -114.12962865374453) |
| 2 | SOUTH | WILLOW PARK | Crime | Theft FROM Vehicle | 4 | 5328 | 11/01/2019 12:00:00 AM | 2019 | NOV | 2019-NOV-WILLOW PARK-Theft FROM Vehicle-4 | (50.95661926653037, -114.05620194518823) |
| 3 | SOUTH | WILLOW PARK | Crime | Commercial Robbery | 1 | 5328 | 11/01/2019 12:00:00 AM | 2019 | NOV | 2019-NOV-WILLOW PARK-Commercial Robbery-1 | (50.95661926653037, -114.05620194518823) |
| 4 | WEST | LINCOLN PARK | Crime | Commercial Break & Enter | 5 | 2617 | 11/01/2019 12:00:00 AM | 2019 | NOV | 2019-NOV-LINCOLN PARK-Commercial Break & Enter-5 | (51.0100906918158, -114.12955694059636) |
| Sector | Community | Group | Category | Crimes | Date | Year | Month | Latitude | Longitude | Residents | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | North | Thorncliffe | Crime | Theft FROM Vehicle | 9 | 2018-03 | 2018 | 3 | 51.1031 | -114.068779 | 8474 | Elsewhere |
| 1 | North | Thorncliffe | Crime | Assault (Non-domestic) | 2 | 2018-03 | 2018 | 3 | 51.1031 | -114.068779 | 8474 | Elsewhere |
| 2 | North | Thorncliffe | Crime | Commercial Break & Enter | 2 | 2018-03 | 2018 | 3 | 51.1031 | -114.068779 | 8474 | Elsewhere |
| 3 | North | Thorncliffe | Disorder | Physical Disorder | 1 | 2018-03 | 2018 | 3 | 51.1031 | -114.068779 | 8474 | Elsewhere |
| 4 | North | Thorncliffe | Disorder | Social Disorder | 48 | 2018-03 | 2018 | 3 | 51.1031 | -114.068779 | 8474 | Elsewhere |
Double-check the cleaned data, compare it to the raw data, and save it as a new file: crime_data.csv.
Then all members in our group can use the cleaned data to write their own part.
Step 1. Univariate Analysis
Step 2. Bivariate Analysis
Step 3. Multivariate Analysis
There are three categories of variables in dataset, we will discover their own distributions at first:
Crime data: group, category, crime count
Location Data: sector, community, resident count
Time data: year, month (starts at 2012-01 and ends at 2019-12 by monthly)
Crime Data Breakdown: --------------------- 1. Crime Records can be divided into 2 major groups: crime and disorder. 2. Don't Panic! Most of records in the crime dataset are actually disorders, not real crimes. Guiding Question: What are general categories for crime records in Calgary?
Crime Data Breakdown: --------------------- 1. More than 70% of crime records are social disorder cases. 2. For non-disorder crimes, theft from vehicle and theft of vehicle happened most frequently. Guiding Question: What kind of crime happened most frequently?
Location Data Breakdown: ------------------------ 1. There are approximately 300 communities in Calgary that are categorized into 8 sectors. 2. For those communities left blank, no crime records were found, e.g. University District. Guiding Question: Can I have a map for Calgarian sectors and communities?
Location Data Breakdown: ------------------------ 1. In Calgary, population distributed unevenly across different communities. 2. Population in 133 communities increased from 2012 to 2019, while 84 decreased. Guiding Question: Population distribution and how did it changed in Calgary?
It's unfair to use crime numbers to represent safety level, since communities with large population tend to have more crimes.
Crime Density ($\rho_{crime}$): crime number per resident per year for certain area.
Warning: residents in communities such as airport and parks are 0, since nobody really live there. For those places, calculating crime density is meaningless. Therefore, crime density will only be used for communities with more than 500 residents.
Guiding Question: How can we correctly measure safety level by using crime data?
Here we are going to explore relationships between different data groups:
Crime vs Time: crime distribution by year, and by month
Crime vs Location: crime distribution by sector, and by communities (maps will be provided for this part)
Crime vs Time: -------------- 1. Both total crimes and crime density values in Calgary is increasing during recent years. 2. Safety issue got worse specially in the year of 2015, probably due to oil price collapse. Guiding Question: Is Calgary becoming a safer place during recent years?
Crime vs Time: -------------- 1. Both crime numbers and density values are at peak during summer and low during winter. 2. The seasonality is quite obvious for crimes since Calgary has extremely cold winters. Guiding Question: Which time in a year is relatively safe?
Crime vs Location: ------------------ 1. Based on total crime records, Centre Calgary is much more dangerous than other sectors. 2. Based on the data of crime density, East and Centre Calgary are on the lowest safety level. Guiding Question: Which area in Calgary is relatively safe?
Crime vs Location: ------------------ 1. Beltline and Downtown Commercial Core are in downtown area, both belong to Centre Calgary. 2. Downtown Commercial Core is 'outstanding' in terms of both crime number and crime density. Guiding Question: Which area in Calgary is relatively unsafe?
Crime vs Location: ------------------ 1. Downtown area is exceptionally unsafe, in terms of both crime number and density. 2. Crime density map gives us a better comparison of safety level across communities. Guiding Question: Can I have a crime map?
Crime vs Location: ------------------ 1. It's easier to compare after ruling out most unsafe regions and applying k-means algorithm. 2. In general, East Calgary is relatively unsafe, while Northwest, North and West look safer. Guiding Question: Can I have a better crime map?
Crime vs Location (Visualization) Guiding Question: Can I have an interactive crime map?
Finally, we are going to explore relationships between all 3 data groups:
Crime vs Time & Location: crime related parameters by locations during 2012-2019
Animations will be provided for the entire of this part
1. Centre Calgary is getting more unsafe during recent years with increasing crime density. 2. From 2012 to 2019, safe sectors are always safe, unsafe sectors are still unsafe. Guiding Question: I want to see how safety level changes by both location and time.
1. Crime density fluctuation could be obvious for a single community within a few years. 2. More analysis will be carried out for downtown, since lots of crimes happened there. Guiding Question: I want to see how safety level changes by both location and time.
1. Percentages for commercial break & enter as well as assault are higher in downtown. 2. Percentages for residential break & enter and theft of vehicle are lower in downtown. 3. Commercial break & enter outweighed non-domestic assault in downtown area since 2019. Guiding Question: How unsafe is the downtown area, what's the major crime specie here?
Time-related:
1. winter is relatively safe in Calgary, compared to other seasons.
2. unfortunately Calgary is getting more dangerous during recent years, especially the downtown area.
Location-related:
1. suburban areas like the northwest, west and north Calgary are relatively safe.
2. downtown is the most dangerous region, as well as east Calgary.
3. theft of vehicle, residential break/enter happened less frequently in downtown, perhaps due to higher condo rate for housing.
4. downtown has higher percentage for assault and commercial break/enter, perhaps due to more businesses and entertainments.
Crime-related:
1. in Calgary, around 70% of crime records are disorders.
2. theft from vehicle is the most popular crime, for both downtown and suburban area.
3. by using crime density, we can partly offset the influence of population to crime numbers.
4. however, crime density won't work well for places with few residents.
Advanced-level:
1. we produced many maps for better illustration, including one interactive map.
2. measured by crime density, relatively safe regions in 2012 are still relatively safe in 2019.
Here are a few things we can do for better research results:
1. search for crime data from 2020 to 2021, to investigate safety issue during COVID-19.
2. search for crime data with days and hours, to test which day in a week, and which time in a day is safe.
3. search for crime data with locations for each crime, to investigate which block or building is unsafe.
4. search floating population for each community, to get better crime density results than using residents.